Data curation system with version control for workflow states and provenance

ABSTRACT

A data curation system that includes various methods to enable efficient reuse of human and machine effort. To reuse effort, various facilities are presented that model, save, and allow the querying of provenance and state information of a curation workflow and allow for incremental, stateful transitions of the data and the metadata.

CROSS-REFERENCE TO RELATED APPLICATIONS

This utility patent application is a continuation of and claims priorityfrom U.S. patent application Ser. No. 14/474,919, filed Sep. 2, 2014,titled “DATA CURATION SYSTEM WITH VERSION CONTROL FOR WORKFLOW STATESAND PROVENANCE” naming inventors Vladimir Gluzman Peregrine, Ihab F.Ilyas, Michael Ralph Stonebraker, Stan Zdonik, Andrew H. Palmer,Alexander Richter Pagan, Daniel Meir Bruckner, George Beskales, AizanaTurmukhametova, Tianyu Zhu, Kanak Kshetri, Jason Liu, and NikolausBates-Haus, which is a continuation of and claims priority from U.S.patent application Ser. No. 14/460,145, filed Aug. 14, 2014, titled“DATA CURATION SYSTEM WITH VERSION CONTROL FOR WORKFLOW STATES ANDPROVENANCE”, naming inventors Nikolaus Bates-Haus, George Beskales,Vladimir Gluzman Peregrine, Ihab F. Ilyas, Kanak Kshetri, Daniel MeirBruckner, Andrew H. Palmer, Michael Ralph Stonebraker, Jason Liu, AizanaTurmukhametova, Tianyu Zhu, and Alexander Richter Pagan.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever. Copyright 2018 Tamr, Inc.

BACKGROUND OF THE INVENTION Field of the Invention

This invention relates to cleaning, transforming, integrating, anddeduplicating data from multiple data sources. More specifically, theinvention is a data curation system, including various methods to enableefficient reuse of human and machine effort. To reuse effort, variousfacilities are presented that model, save, and allow the querying ofprovenance and state information of a curation workflow and allow forincremental, stateful transitions of the data and the metadata. Productsand services embodying the invention operate in the markets includingdata cleaning, record deduplication, data integration, data quality, anddata transformation.

Background

Systems such as those provided by Informatica, Oracle's Silver CreekSystems, and IBM InfoSphere QualityStage are used to integrate datacoming from different data sources, standardize data formats (e.g.,dates and addresses), and remove errors from data (e.g., duplicates).These systems typically depend on a data expert (i.e., a human that hasknowledge about the semantics of the data) to manually specify low-levelprocedures to clean the data. Coming up with an efficient and effectivedata integration plan mainly depends on the skills of the data expert.The audience targeted by such systems are assumed to be extremelyfamiliar with the data (e.g., experienced in data analytics).

Two major challenges facing such systems are scale and state.

Regarding scale. Existing systems do not scale to the sizes of problemscurrently found in the field. For example, one web aggregator requiresthe data curation of 80,000 URLs, and a second biotech company has theproblem of curating 8,000 spreadsheets. At this scale, data curationcannot be a manual (human) effort, but it must entail machine-learningapproaches with a human assist only when necessary. Existing systemsinvolve a large amount of manual effort (e.g., selecting which machinelearning algorithm to use, what training data to collect, what candidategeneration criteria to use, etc.). Also, existing systems assume thatthe user is extremely familiar with the data, which is not necessarilythe case in practice.

Regarding state. Data integration workflow is an iterative process. Forexample, in a medical database, if one data source includes a fieldcalled “room number” and a second data source also includes a fieldcalled “room number,” then a data curator (i.e. system operator) maymake the initial decision that the first field and second field containthe same data. Later, the system operator may learn that the first fieldreferred to “doctor's room number” and the second field referred to“patient's room number,” so the initial decision about these fields wasincorrect. In the interim, however, other actions (such as recorddeduplication and schema mapping) would have been taken on the data,actions which may or may not need to be undone. The system operator nowneeds to go back in time to a previous version of the data, understandwhich decisions were made and why, reuse previous man and machineefforts where possible, and consider the implications of decisions onthe future state of the data. At each state, metadata, such as whatdecisions were made and why, exist but are not necessarily tracked. Ascan be appreciated, multiple versions (e.g. parent and child) and paths(or branches) are possible, but current systems do not provide forefficient version tracking, management, or control.

DESCRIPTION OF PRIOR ART

U.S. Patents

U.S. Pat. No. 7,970,630 (issued 2011 Jun. 28, name Fagan et al., title“INTEGRATED BIOMEDICAL INFORMATION PORTAL SYSTEM AND METHOD”) discloses,in the Abstract, “A computer-implemented system and method forintegrating data from a plurality of biomedical development phases. Thesystem and method include a database that stores data collected from thebiomedical development phases. The database further includes a metadatadata structure that describes the data collected during a biomedicaldevelopment phase. At least one graphical user interface collects dataduring the biomedical development phase. The structure of the graphicaluser interface is defined based at least in part upon the metadata datastructure so that the graphical user interface collects data points aswell as metadata that is to be stored within the metadata datastructure. The metadata describes the collected data points, and atleast a portion of the metadata data structure is determined based uponan issue that arises in a subsequent biomedical development phase.”

A system for storing in one place the metadata and data related to atreatment in development by a pharmaceutical company or similarenterprise. The system stores raw data, metadata, genomic information.It assists with data entry and with making the data and metadataavailable to the right people at the right times. However, it isprimarily a storage and retrieval system. This system does not enableits users to enrich the data in any significant way, nor does it provideany general-purpose enrichment tools.

U.S. Patent Application Publications

United States Patent Application Publication 2009/0138415 (published2009 May 28, name Lancaster, title “AUTOMATED RESEARCH SYSTEMS ANDMETHODS FOR RESEARCHING SYSTEMS”) discloses, in the Abstract, “Systemsand methods that provide for automated research into the workings of oneor more studied systems include automated research software modules thatcommunicate with domain knowledge bases, research professionals,automated laboratories experiment objects, and data analysis processes,wherein automatically selected experiment objects can be run at anautomated laboratory to produce experimental results, and the subsequentdata-processing providing automated guidance to a next round ofexperiment choice and automated research. An Experiment Director rulesengine chooses Experiment Objects based on user input through a QueryManager.”

A system for self-guided research. Essentially, under loose supervisionthis system explores the parameters of some real-world complex system,such as the Earth's climate or a human cell and attempts to drawconclusions. This system improves the throughput of automated experimentframeworks such as cellular assays by providing quick decisions of whichexperiments might be done next to maximize the amount learned from theexperiments. The degree of interactions with humans seems to be limitedto providing some initial hints about which parameters might be worthinvestigating. It does not involve a human curator or human experts, norenable them to save time and reuse past work.

United States Patent Application Publication 2010/0228699 (published2010 Sep. 9, name Webber et al., title “SYSTEM AND METHOD FORINTERACTING WITH CLINICAL TRIAL OPERATIONAL DATA”) discloses, in theAbstract, “A method and system for exchanging clinical trial operationaldata by using a centralized shared server system connected to aplurality of shared servers. The system and method manage a plurality ofclinical trial-related applications by creating a plurality of tablesstored within the shared database of the shared database systemconnected to a centralized shared server system within a virtual networkfor updating and sharing among clinical trials. The current system andmethod allow exchanging clinical trial operational data between acentralized shared server system and a plurality of shared servers todelegate responsibility to other clinical trial organization users forproducing subsets of clinical trial operational data with limited dataaccess rights. The current system and method allow assigning data accessrights to other clinical trial organizations by configuring the at leastone other clinical trial organization as either a producer or a consumerof the clinical trial operational data for limiting access to the atleast one table with the clinical trial operational data by the at leastone other clinical trial organization. The current system and methodallow each business partner to manage the assigned responsibilities byusing existing clinical trial management systems applications and tomaintain views of other clinical trial organizations activities ofclinical trial operational data subject to assigned data access rights.”

This system is mainly about sharing and security in managing clinicaltrials data and ensuring the appropriate people—and only the appropriatepeople—are able to see the data easily. There is no functionality forproposing enhancements or links in the data, nor any curationcapabilities.

United States Patent Application Publication 2013/0091170 (published2013 Apr. 11, name Zhang et al., title “MULTI-MODALITY, MULTI-RESOURCE,INFORMATION INTEGRATION ENVIRONMENT”) discloses, in the Abstract, “Amulti-modality, multi-resource, information integration environmentsystem is disclosed that comprises: (a) at least one computer readablemedium capable of securely storing and archiving system data; (b) atleast one computer system, or program thereon, designed to permit andfacilitate web-based access of the at least one computer readable mediumcontaining the secured and archived system data; (c) at least onecomputer system, or program thereon, designed to permit and facilitateresource scheduling or management; (d) at least one computer system, orprogram thereon, designed to monitor the overall resource usage of acore facility; and (e) at least one computer system, or program thereon,designed to track regulatory and operational qualifications.”

A system for coordinated presentation and management of scientific andadministrative data in the field of biomedical research. This systemdoes not enrich the data in any way, and finds no links except thosegiven to it by its operators or revealed by trivial full-text search. Italso manages a set of workflows, but does not in any way allow users toreuse their efforts across changes in context.

None of the above provides a system with:

(a) methods for workflow creation and modeling, including:

-   -   (i) defining curation actions, decisions, and data states; and    -   (ii) details of the techniques used in modeling transitions, and        other lineage information presented in a provenance language        that links curation states and human/machine actions to specific        state transitions;

(b) methods for workflow manipulation;

(c) methods for mining semantic dependency among curation actions andobject linkage decisions; and

(d) methods for responding to a specific change and for using thedependency among the previous curation actions to identify reusablecuration actions and metadata.

What is needed, therefore, is a system that overcomes theabove-mentioned limitations and that includes the features enumeratedabove.

BRIEF SUMMARY OF THE INVENTION

The invention is a data curation system that includes various methods toenable efficient reuse of human and machine effort. To reuse effort,various facilities are presented that model, save, and allow thequerying of provenance and state information of a curation workflow andallow for incremental, stateful transitions of the data and themetadata.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high-level depiction of the subsystems of the currentsystem.

FIG. 2 is system diagram showing hardware components of the system.

FIG. 3 illustrates basic state history.

FIG. 4 illustrates state branching.

FIG. 5 illustrates state branch merging.

FIG. 6 illustrates state branch rebase.

DETAILED DESCRIPTION OF THE INVENTION, INCLUDING THE PREFERREDEMBODIMENT

In the following detailed description of the invention, reference ismade to the accompanying drawings which form a part hereof, and in whichare shown, by way of illustration, specific embodiments in which theinvention may be practiced. It is to be understood that otherembodiments may be used, and structural changes may be made withoutdeparting from the scope of the present invention.

Overview

Data integration is (a) mapping schemas of multiple data sources intoone global schema, and (b) deduplicating records in such sources. Inother words, data integration involves two object linkage exercises:column/field/attribute linkage and row/record linkage.

Data curation is the more broad act of (a) discovering a data source ofinterest, cleaning and transforming the new data; (b) semanticallyintegrating it (as above) with other local data sources; and (c)deduplicating the resulting composite. Data curation includes schemamapping, record deduplication, transformation, etc.

Referring now to FIG. 1, a high-level depiction of the current system.The system has the following major components:

Curation Process Module 110;

State Creation And Manipulation Module 200;

Curation States And Provenance Datastore 220;

Update Handler Module 230; and

Human Players, namely

-   -   System Operator 130; and    -   Data Experts 120.

Curation Process Module 110.

Curation Process Module 110 is a subsystem of the overall systemdescribed herein, and this subsystem is described in more detail in U.S.patent application Ser. No. 14/228,546 “METHOD AND SYSTEM FOR LARGESCALE DATA CURATION” (Bates-Haus et. al, filed 2014 Mar. 28). Thissubsystem “allows integrating a large number of data sources bynormalizing, cleaning, integrating, and deduplicating these datasources. The [sub]system makes a clear separation between the systemoperator, who is responsible for operating the system (e.g., ingestingdata sources, triggering data integration tasks), and data experts whohave enough expertise to answer specific questions about the data.”[Para. 0020] This subsystem “abstracts schema mapping and recorddeduplication as object linkage, where an object could refer to a column(i.e., field/attribute) or a row (i.e., a record) in a data source. Suchabstraction allows seamless data integration between solutions of bothproblems. This data integration allows solving the two problems in aholistic way rather than one problem at a time.” [Para 0020]

Curation Process Module 110 continuously operates on data, taking rawdata (not shown) as input, cleaning it, transforming it, semanticallyintegrating it with other data, and deduplicating the resultingcomposite. A version of the output, at any given point in time, includesboth the data (as currently curated) and metadata (which represents thecuration state of the underlying data). Curation state includes detailsabout when and to what extent data curation has occurred (for example,whether or not two objects in the data have been linked) any why(provenance).

Provenance will be discussed further below.

Curation Process Module 110 combines its own machine analysis with inputfrom users (namely, Data Experts 120 and System Operator 130) to proposecuration state changes (including linkage changes, transformations,etc.) as one or more Curation Proposal 150 to System Operator 130. Forexample, when Curation Process Module 110 outputs one or more CurationProposal 150 to System Operator 130, then System Operator 130 mustdecide whether or not to issue one or more of corresponding CurationApproval 160 to Curation Process Module 110. If Curation Proposal 150 isapproved by System Operator 130 (as Curation Approval 160), thenCuration Approval 160 is implemented by Curation Process Module 110,thereby making the approved proposal(s) part of a new linkage state.

Actions that can be taken by System Operator 130 will be discussedfurther below.

State Creation and Manipulation Module 200.

State Creation and Manipulation Module 220 takes as input a set ofdeltas and provenance information for each delta. State Creation andManipulation Module 220 creates and outputs a new system state andappropriate provenance information (collectively New States AndProvenance 210).

More specifically, when any change (i.e. delta) has been made tocuration state, Curation Process Module 110 outputs State Changes AndProvenance 180 to State Creation And Manipulation Module 200. StateChanges And Provenance 180 includes metadata, namely state changemetadata (e.g. when and to what extent data curation has occurred, howand to what extent the child state differs from the parent state) andprovenance metadata (e.g. why a particular change occurred). Provenancemetadata includes machine-processable information describing whysomething is considered true. An example of provenance metadata couldalso be a notes field indicating that two different fields in the datashould no longer be linked because, for example, they do not contain thesame data (such as in the “doctor's room number” and “patient's roomnumber” example above). State Creation And Manipulation Module 200 thenoutputs updated metadata as New States And Provenance 210, which isstored in Curation States And Provenance Datastore 220.

Curation States and Provenance Datastore 220.

Curation States and Provenance Datastore 220 records the history ofcuration states, as well as the details of each curation state and theprovenance of all elements in each curation state. Curation States andProvenance Datastore 220 supports Structured Queries 140 from (andoutputs to) Update Handler Module 230.

Update Handler Module 230.

Update Handler Module 230 processes provenance information to understandhow changes to curation state affect existing curation state elements.Update Handler Module 230 proposes further changes (as Update Proposal190) to System Operator 130 to enable consistent provenance, whilere-doing as little human work as possible. Update Handler Module 230 canalso communicate directly with Curation Process Module 110 (connectionnot shown) as discussed further below.

Human Players (Curator and Experts).

A curator, System Operator 130, drives the data curation effort. SystemOperator 130 initiates curation data actions, initiates and approves allstate changes in the system, and supervises the integration of machinejudgment (from Curation Process Module 110) and human judgment (fromData Experts 120).

Data Experts 120 are the humans whose guidance enables Curation ProcessModule 110 to make proposals. Data Experts 120 supply the ground truthinsight that enable Curation Process Module 110 to function.

Referring now to FIG. 2, is system diagram showing hardware componentsof the system. Storage/Compute Tier 340 is where all the state is storedand where all the data-scale computations take place. Storage/ComputeTier 340 can be a large-scale traditional RDBMS system like Vertica orOracle, or it can be a Hadoop cluster, communication with which happensin SQL. Orchestrator Tier 330 can share hardware with Storage/ComputeTier 340 or it can be implemented on separate hardware. If separate,Orchestrator Tier 330 can be run on commodity application serverhardware. Orchestrator Tier 330 is where the business logic executes andwhere human-scale operations take place. Modern web browsers (WebBrowser 310 and Web Browser 320) are used to interface users (SystemOperator 130 and Data Experts 120, respectively) with the application(via Orchestrator Tier 330).

Operation

Types of Curation Actions and Control Flow.

Referring now to the interaction between System Operator 130 andCuration Process Module 110.

There are two types of actions that can be taken by System Operator 130:

1. Curation Data Action 170 instructs Curation Process Module 110 toperform curation (e.g. data loading, transformation, or linkage).

2. State History Action 290 involves the direct manipulation of thestate history (e.g. back up to a previous state and start a new branchfrom there, merge two independent streams of work, re-apply an actionfrom another branch of work) via State Creation And Manipulation Module200.

Curation Data Action 170 action goes through several phases, describedas follows.

Phase 1: Action Initiation.

Curation Data Action 170 is initiated by System Operator 130, or byCuration Process Module 110 at the prior authorization of SystemOperator 130 (e.g. via a scheduled task). Curation Data Action 170involves the invocation of one or more of the curation processorsavailable in the system. System Operator 130 (optionally via a graphicalcomputer user interface) provides the system with a definition of whichprocessors to invoke and how to configure them. At this time, SystemOperator 130 may also provide a dependency processing mode (DPM), or, inpreview mode, System Operator 130 may choose to see which stateelement(s) will be invalidated by the action before deciding on whichDPM to use for the action.

Phase 2: Action Processing.

Curation Process Module 110 is configured and invoked against thecurrent curation state and then produces an initial set of changes(State Changes And Provenance 180) to be applied to the curation statein order to make a new curation state. In addition to the deltas (i.e.changes), Curation Process Module 110 may produce a set of suggestions(Curation Proposal 150) for further changes to be applied. With eachsuggestion may be included a confidence, as for example produced by alinkage classifier. Data Experts 120 may be queried at the discretion ofSystem Operator 130 in the generation of suggestions.

Phase 3: Suggestion Processing.

Any suggestions generated in the Action Processing phase are presentedto System Operator 130 for feedback. System Operator 130 may examineindividual suggestions and accept (as Curation Approval 160) or rejectthem. System Operator 130 may also accept or reject suggestions in bulkby providing selection criteria for these suggestions, such as “acceptall with confidence above 70%.” This process continues until allsuggestions are accepted or rejected. The changes represented by anyaccepted suggestions are added to the set of Candidate Changes 240 to beapplied. Candidate Changes 240 are a machine-readable form of CurationProposal 150, formatted for processing by Update Handler Module 230.

Phase 4: Dependency Identification.

The set of Candidate Changes 240 is sent to Update Handler Module 230.Update Handler Module 230, which computes the set of propagated changesas Update Proposal 190 (i.e. the changes that would result frompropagating dependencies from Candidate Changes 240). Update Proposal190 includes the invalidated state elements (i.e. any state elementsinvalidated by Candidate Changes 240).

Phase 5: Dependency Review.

If System Operator 130 has not yet provided a dependency processing mode(DPM), then Update Handler Module 230 presents Update Proposal 190 toSystem Operator 130 for review, and System Operator 130 selects a DPM touse. Optionally, even if System Operator 130 had previously selected aDPM and certain pre-specified conditions are met (e.g. number ofinvalidated changes exceeds some threshold), then Update Proposal 190 ispresented for review and System Operator 130 is given the opportunity tospecify a different DPM to use. Based on the DPM, Update Handler Module230 will decide whether the change should be approved (i.e. allowed togo forward) or rejected. Each proposal may have metadata such as aconfidence associated with it in order to allow bulk processing ofproposals. The changes approved by System Operator 130 as UpdateApprovals 280 are Final Changes 250.

Phase 6: New State Creation.

Once Update Approvals 280 have been gathered, Update Handler Module 230sends Final Changes 250 to State Creation And Manipulation Module 200,which adds them to the original set of Candidate Changes 240 and createsa new curation state by applying these changes to whatever is thecurrent curation state. The current state pointer is then updated to thenewly created state, and the new state is made a child of what was thecurrent state at the beginning of the operation.

Curation State.

A curation state consists of the following elements:

1. One or more data sources (not shown);

2. Object linkage facts (not shown) between records or attributes in thedata sources;

3. Linkage system state (not shown), including any training data,models, signals, and intermediate state computed during the generationof linkage suggestions;

4. Linkage Questions 260 posed to Data Experts 120; and

5. Linkage Opinions 270 from Data Experts 120 given in response toQuestions 260.

Each data source consists of a number of records. Each record is acollection of key-value pairs, with any key appearing zero or moretimes. A key present on any record of a data source is an attributeassociated with the data source of which the record is a part.Attributes that are associated with different data sources are distinct.Records that are part of different data sources are distinct.

Each data source or attribute may have arbitrary structured metadataassociated with it.

Object linkage facts come in two varieties:

1. Attribute linkage (i.e. schema mapping forcolumns/fields/attributes); and

2. Record linkage (i.e. record de-duplication for rows/records).

A linkage fact indicates that two objects (i.e. attributes or records,which may be from the same or different data sources) are linked.

Object linkage facts are transitive. Thus, all attributes and recordswithin the system can be divided into a set of connected components. Wecall the attribute-connected components “derived attributes” and therecord-connected components “derived records.” Thus, the curation stateimplies an integrated derived view of all of the data in the system as adata source, with derived attributes acting as attributes and derivedrecords acting as records. In many (if not most) applications of datacuration, the derived data is the ultimate goal of System Operator 130,as it is the derived data that will be used in downstream analysis.

State Creation.

State Creation And Manipulation Module 200 is responsible for creatingnew states. States that already exist cannot be modified. At any givenpoint in time, one of the states is marked as the current curationstate, and states created via actions (e.g. via Curation Data Action170) will be children of the current curation state. In the figures(FIG. 3, FIG. 4, FIG. 5, and FIG. 6), State 2 (created by Action 1) isthe child of State 1.

State Creation And Manipulation Module 200 processes batches of statechange commands and creates one new state per batch. See below for howdifferent states are linked via provenance and history. The state changecommands include:

1. Add Data—Load a data source or add data to an existing data source.

2. Transform Data—Transform a data source, either in place or to createa new data source.

3. Add Object Linkage Facts—Self-explanatory.

4. Add Linkage Questions 260—Self-explanatory.

5. Accept Linkage Opinions 270—Self-explanatory.

Curation State Provenance.

As System Operator 130 takes curation actions, the curation actionsresult in a changed curation state. For example, System Operator 130might direct the system to do one or more of the following:

1. Create new data sources by loading them from external systems or bytransformation from existing data sources.

2. Modify data sources by transformation (either creating new attributesderived from existing ones or modifying existing attributes andrecords).

3. Load object linkage facts into the system.

4. Generate linkage suggestions based on data signals and expertopinions, and accept some of them, resulting in new object linkagefacts.

The new curation state created by a curation action will have a numberof differences (deltas) from its parent state. For each of thesedifferences, it is possible to record provenance information such as:

1. Which Curation Data Action 170 by System Operator 130 resulted inthis difference?

2. Which System Operator 130 took this curation action?

3. What conditions have to hold in order for this element to remainvalid? For example:

-   -   (a) A linkage fact that was explicitly approved by System        Operator 130 is valid unless System Operator 130 withdraws        his/her approval.    -   (b) A linkage model that was created based on some training data        might remain valid while a critical mass of training facts        remain valid. A critical mass might be defined as a percentage        of the data, or in terms of some statistical properties of the        training data.    -   (c) A linkage fact that was implicitly approved by System        Operator 130 as part of a bulk approval remains valid as long as        it meets the criteria for the bulk approval. E.g:        -   (i) Confidence based on model >85%.        -   (ii) Expert consensus >90%.    -   (d) Combining (b) and (c) into a single scenario, if a critical        number of training facts for a model becomes invalid, then any        object linkage facts that were approved in bulk based on        confidence scores from that model are also considered        potentially invalid.

Each curation element thus derives its validity either from directapproval by System Operator 130 or from some computation involvingcuration state elements from the parent state, which in turn may havethe same relationship with states earlier in the history. Thus, thecuration element provenance forms a Directed Acyclic Graph (DAG), witheach node without in-edges representing a curation action (e.g. loadingdata, authoring a transform or approving a linkage fact), and each nodewith in-edges representing some individual curation state element.

Using a graph traversal, it is thus possible to determine upon whichcuration state elements an individual element depends, and which onesdepend upon it.

Curation State History.

As noted above, each curation action creates a new curation state. Thesystem records a history of all curation states that have ever existed.This creates a directed graph where the nodes are curation states andthe edges are curation actions. Using this graph, the system supportsstandard undo/redo/branch operations, where System Operator 130 can“back up” to a previous curation state (via State History Action 290)and start working from there. The state history storage may beimplemented using various mechanisms, for example:

1. At each action, the new state may be written in its entiretyseparately from the previous state and associated with the action thatproduced it (i.e. a fully materialized storage system).

2. At each action, the differences between the new state and theprevious state may be recorded and associated with the action thatproduced them (i.e. a delta storage system).

3. In a purely delta storage system, however, queries can become slowdue to the large number of deltas to be followed, so parts of the state(or the whole state) can periodically be materialized anew to ensureresponsive queries.

4. To ensure responsiveness in the face of a potentially long-runningmaterialization, such materialization can be implemented as a backgroundoperation, with queries transitioned from the delta representation tothe materialized representation once the materialization has completed.

Any of these (or other) strategies may be applied to disjoint parts ofthe curation state (e.g. attribute linkage may be copied wholesale,while record linkage may be stored using deltas with periodicmaterialization).

Each individual curation state is called a version. This terminology issimilar to that used with the Git (see www.git-scm.com) data model,where versions form a directed graph. In other words, just as Gitsoftware implements version control for source code, the inventionimplements version control for data curation.

The history graph can also provide a temporal view of dependencies.Because this is a directed graph, all states can be described asancestors or descendants of any given state (with only the state itselfbeing both ancestor and descendant; all other states are just one or theother). This implies another sense of provenance, with pieces of statebeing dependent on all previous state(s).

It is also possible for the system to assist a user in combining workfrom different branches of the version/action graph. In this situation,a new state may be created with two state parents, with the tips of thetwo branches being combined. This and others similar operations aredescribed further below in the Update Handler Module 230 and StateHistory Actions sections.

Curation Process Module 110.

Curation Process Module 110 generates curation state changes along withprovenance information for each state element.

Linkage System.

This system generates linkage suggestions and confidences, and may usehuman experts and machine-learning-based classifiers to do so. See U.S.patent application Ser. No. 14/228,546 (previously discussed) fordetails.

Object linkage suggestions are presented to System Operator 130 forapproval. System Operator 130 may approve linkage suggestionsindividually. Since the number of linkage suggestions is usually large,System Operator 130 may choose to approve or reject suggestions in bulk,using criteria based on the curation state, such as data filters,confidence filters, etc.

The provenance of each linkage state element has the following parts:

1. Did System Operator 130 explicitly approve this linkage element? Ifso, it is considered valid as long as the constituent data presented toSystem Operator 130 for approval remain the same. For example, it mightmean:

-   -   (a) For a record linkage fact, this means that the linked        records keep the same attributes with the same values.    -   (b) For an attribute linkage fact, this means that the linked        attributes keep the same values in the same records.

2. Did System Operator 130 approve this linkage element as part of abulk approval based on some criteria? If so, then it is considered validas long as the criteria remain true. If the criteria involve aconfidence from a model, and that model becomes invalid, then the modelmay be recomputed, subject to the approval of System Operator 130. Ifthe element satisfies the criteria with the new model, then the elementremains valid. For example:

-   -   (a) Confidence >85%.    -   (b) Expert consensus >90%.

The provenance of internal linkage elements such as machine learningmodels is based on the inputs to how those models were computed,including training data and answers to training questions that wereapproved by System Operator 130.

Transformation System.

The transformation system allows System Operator 130 to modify the datain the data sources that are part of the curation state. The systemsupports a transformation language for describing how new data is to begenerated from existing data. System Operator 130 may also use externaltools (e.g. ETL tools) to generate new data based on the old data.

The provenance of each data element output from the transformationsystem is based on the inputs to the transformation process thatgenerated the given element. As long as the elements that are inputs tothe transform remain valid, the output remains valid.

Update Handler Module 230.

Update Handler Module 230 is responsible for propagating changes tocuration state to ensure the provenance of all state elements in eachcuration state is consistent. Update Handler 230 identifies which factscan remain, which facts need to be removed, and which new facts need tobe added. In order to do so, Update Handler Module 230 receives fromSystem Operator 130 a Dependency Processing Mode (DPM) (not shown),whose possible values include:

1. RESTRICT—Don't allow new state creation if it will mean invalidatingany existing curation state. In this mode, only operations that don'tchange any existing facts are allowed. Examples include loading new dataand loading new curation facts. For example:

-   -   (a) Candidate Changes 240 include creating a new attribute in a        data source as a function of two other attributes. The new        attribute has no linkage to any others. This change is allowed        to go forward.    -   (b) Candidate Changes 240 include reversing the linkage between        two attributes. This linkage was used in the training of the        record linkage model, and its removal would cause the model to        no longer be valid, and, transitively, all of the object linkage        facts that rely on the model's confidence outputs to be invalid.        This change is rejected.    -   (c) Candidate Changes 240 include reversing linkage between two        records. This linkage fact was used in the training of the        record linkage model, but the change is small enough that the        record linkage model remains valid. This change is allowed to go        forward.

2. PROPAGATE—Use the provenance information of any elements beingchanged in order to compute further changes whose application would makethe provenance of all elements consistent.

-   -   (a) Candidate Changes 240 include creating a new attribute in a        data source as a function of two other attributes. The new        attribute has no linkage to any others. No propagation is        required.    -   (b) Candidate Changes 240 include reversing linkage between two        attributes. This linkage was used in the training of the record        linkage model, and its removal would cause the model to no        longer be valid, and, transitively, all of the object linkage        facts that rely on the model's confidence outputs to be invalid.        In order to propagate this change:        -   (i) A new model is computed, potentially including a            training phase where Data Experts 120 (and System Operator            130) may be shown some of the model's predictions in order            to validate the model's quality.        -   (ii) The final model's predictions are shown to System            Operator 130, who approves the model.        -   (iii) System Operator 130 may, at this point, be given the            opportunity to update any approval/rejection thresholds.        -   (iv) Any object linkage facts whose confidences under the            new model are high enough to meet the bulk approval            thresholds provided by System Operator 130 remain facts. Any            facts whose confidences are lower than the rejection            thresholds are reversed (recorded explicitly to be false).        -   (v) System Operator 130 may be given an opportunity to            select some linkage proposals for manual review by Data            Experts 120, and direct approval by System Operator 130.

3. OVERRIDE—Any state elements rendered invalid by the changes areconsidered approved by System Operator 130. For example, CandidateChanges 240 include reversing linkage between two attributes. Thislinkage was used in the training of the record linkage model, and itsremoval would cause the model to no longer be valid, and, transitively,all of the object linkage facts that rely on the model's confidenceoutputs to be invalid. The model's provenance is updated to includeexplicit approval by System Operator 130.

Calculating Propagated Changes.

Given a set of proposed changes to be applied to a curation state inorder to create a new curation state, the invalidated dependencies canbe computed by calculating the set of descendants of all of the proposedchanges in the provenance DAG. A computer software procedure to do thismight look like this:

  public List<StateElement> propagateChanges(List<StateElement>proposedChanges) {  // descendantsOf uses the provenance DAG List<StateElement> descendantsOfChanges =descendantsOf(proposedChanges);  // Topological sort to ensure allancestors of an element are considered  // before the element.  // Thetopological sort uses the provenance DAG  List<StateElement>possiblyAffectedElements = topologicalSort(descendantsOfChanges); List<StateElement> noLongerValidElements = new ArrayList<>( ); for(StateElement element : possiblyAffectedElements) {  if(isElementStillValid(element)) { // skip links from elements thatare not invalidated    continue;   }  noLongerValidElements.add(element);   for(StateElement child :element.getDirectDescendants( )) {    updateProvenance(child);   }  } return noLongerValidElements; }

This procedure computes which elements are no longer valid given theproposed changes. The most pessimistic way to keep the provenanceconsistent would be to remove all elements that are no longer valid.

Processor-Supported Propagation.

Some curation processors may support less pessimistic propagation. Forexample:

1. A transform processor could re-apply the transform to changed valuesthat were inputs to a transform.

2. A linkage processor that maintains a connected-components structurecould do incremental clustering to ensure that the connected-componentsstructure remains consistent.

3. A linkage classifier could be re-applied to a pair of records orattributes, some of whose constituent data had changed, in order togenerate a new linkage proposal.

4. A linkage processor could be re-applied to some subset of the data,where a significant number of constituent values had changed, togenerate a new set of linkage proposals.

5. A linkage model rendered invalid by changes might be re-generatedusing new input from Data Experts 120 and System Operator 130 andre-applied to relevant data to generate updated confidences that wouldsupport bulk-approved provenance.

In this case, the no-longer-valid elements would not be removed from thecuration state, but instead updates to them could be proposed, possiblywith confidence scores, which could then be presented to System Operator130 for approval.

State History Actions.

Similar to the Git source control system, in this curation system, it ispossible to branch from previous system states and to merge twoindependent branches of work. Such actions are performed by SystemOperator 130 as one ore more of State History Action 290 via StateCreation And Manipulation Module 200.

FIG. 3 illustrates basic state history.

Branching.

Branching is the simplest of these operations. To branch, the userspecifies an identifier for the already existing state from which he/shewants to continue working. Future states are then created as children ofthe given state.

FIG. 4 illustrates state branching.

Merging Independent Branches of Work.

To process a MERGE action, designate the two branches being merged as(A) and (B). One of the branches will be used as the base of the merge.If the user has specified which one, then use that one. If not, then usethe one that is a deeper descendant of the least common ancestor of thetips of A and B. If both are equally deep descendants, then choosewhichever was created last.

Without loss of generality (WLOG), suppose that A is the base of themerge. Then, to merge A and B, starting with the action of B originatingat the least common ancestor of A and B, replay all actions in B insequence in PROPAGATE dependency processing mode (DPM).

FIG. 5 illustrates state branch merging. In FIG. 5, state 7 embodies themerge of action 3 and 5 with actions 2 and 4.

Rebase, which is similar to merge, is shown in FIG. 6. In FIG. 6, state8 embodies the rebase (where each action is incrementally reapplied andstates are created for each action) of action 3 and 5 with actions 2 and4.

Other Embodiments

In another embodiment, curation state contains only a portion of theactual data source (rather than the entire data source itself), and therest of the data source is represented by a data source identifier, suchas a pointer or link.

In another embodiment, Curation Process Module 110 drives curation withonly guidelines from the System Operator 130, rather than explicitcommands/action. This may optionally be embodied as a CurationSupervisor module (not shown). For example, System Operator 130 mayprovide a desired level of accuracy in linkage as well as a set ofdesired formats for the data, and the system will take steps asappropriate using the System Operator 130's authorization. The systemmay encounter points where it cannot proceed without System Operator130's guidance. At these points, the system may notify System Operator130 via synchronous communication or by posting a message to a queue orstorage medium that System Operator 130 may access asynchronously askingfor guidance and help. The system may also expose a dashboard userinterface, via which System Operator 130 can inspect system state andview blockages encountered by the system or specific points where thesystem isn't blocked, but where input from System Operator 130 couldmake a large difference to the output.

In another embodiment, Structured Queries 140 supports analytics anddata-mining operations including, for example:

1. Which parts of an organization's data have strong vs. weakprovenance?

2. How well-annotated is is the data that comes from different parts ofthe organization?

3. How widely-used and/or connected is the data from various projects?

4. What is the performance and/or contribution level of individual dataexperts or system operators?

In another embodiment, the system supports exploratory curation andwhat-if scenarios including, for example:

1. Suppose an attribute looks like it contains phone number data. Whathappens if System Operator 130 mark it as such?

2. Suppose some records look like they refer to the same customer. Whathappens if System Operator 130 links them?

3. If an acceptance threshold is set to 80%, then what will anyfalse-positives look like? What about 85%? 90%?

4. What happens if different instructions are given to Data Experts 120?

In another embodiment, Update Handler Module 230 includes additionalfeatures, such as:

1. A setting that affects the degree of pessimism with which state isdeclared invalid.

2. Leaving updates as unresolved, and allowing curation to proceed,while in the background performing computations polls of Data Experts120 polls that provide evidence based on which to make higher-confidenceproposals.

In another embodiment, version-based storage of state in Curation StatesAnd Provenance Datastore 220 enables publishing events via an eventqueueing system (such as an enterprise event bus). As new states arecreated, the state changes in those states get added to the queue. Ifdownstream systems are not able to process reversals of linkage butinstead are able to process full reloads, then it is possibletemporarily to provide invalidated linkage facts explicit provenance inbetween reloads, and then to provide periodic snapshots for full reload.

In another embodiment, the functions of Update Handler Module 230 andState Creation And Manipulation Module 200 are combined into a combinedcomputer module.

It is to be understood that the above description is intended to beillustrative, and not restrictive. Many other embodiments will beapparent to those of skill in the art upon reviewing the abovedescription. For example, the components of the system (includingCuration Process Module 110, State Creation And Manipulation Module 200,Update Handler Module 230, and Curation States And Provenance Datastore220) can be implemented on various computer hardware platforms(including physical, networked, virtual, and cloud) using variouscomputer software programming languages. The scope of the inventionshould, therefore, be determined with reference to the appended claims,along with the full scope of equivalents to which such claims areentitled.

What is claimed is:
 1. A method of provenance creation, tracking, andre-use as part of improved curation of large scale data sets,comprising: operating software on a computer system for data curation,the software performing data curation actions of data loading,transformation, and linkage; wherein data loading comprises: identifyinga new data source external to the operating computer system, wherein thedata source comprises 8,000 or more records, and each record is akey-value pair, wherein every key corresponds to an attribute or columnof the new data source; and loading the records of the new data sourceinto storage accessed by the operating computer system, wherein storageis within a large-scale relational database management system or Hadoopcluster; wherein linkage comprises: posing linkage questions to dataexperts; obtaining linkage opinions from data experts; generatinglinkage suggestions based on machine learning of a linkage model;establishing linkages facts by explicit approval of a linkage suggestionby a system operator, or bulk approval based on system operatorconfigured criteria balancing linkage model confidence and data expertopinion consensus; wherein linkage further comprises attribute linkagefor schema mapping between different data sources, and record linkagefor deduplication; and wherein a linkage fact identifies two attributesor two different records which are linked as equivalent; whereintransformation comprises: applying a transformation script language orextract, transform, load (ETL) tools to create new attributes derivedfrom existing attributes, or modify existing attributes and records;wherein each data curation action comprises the following steps:initiation by the system operator or scheduled task configured by thesystem operator; action processing to produce a Curation Proposalcomprising a set of suggested changes and confidence for eachsuggestion; presenting the curation proposal to the system operator,wherein the system operator may approve or reject individual suggestionswithin the curation proposal or apply a selection criteria to approve orreject suggestions in bulk, forming candidate changes as the set ofaccepted suggestions from the curation proposal; computing an updatedproposal by propagating dependencies from the candidate changes;approving or rejecting changes within the updated proposal based on adependency processing mode (DPM) selected by the system operator, andcreating final changes as the changes approved based on the DPM; andapplying the final changes to a current curation state to create a newcuration state. wherein each curation state includes: one or more datasources; one or more linkage facts about attributes and records of thedata sources; a linkage system state comprising training data, linkagemodels, and any intermediate states computed during generation oflinkage suggestions; one or more linkage questions; one or more linkageopinions; wherein each curation state may be stored independent of aprevious curation state or as a set of changes from the previouscuration state; for every curation state change, recording provenancemetadata about the change, wherein provenance metadata comprises: whichcuration data action occurred to cause the change; which system operatortook the curation data action action causing the change; what conditionsare required for the change to remain valid, wherein linkage actionsremain valid based on: explicit approval actions by a system operatorremain valid until explicit approval is removed; bulk approval actionsremain valid as long as criteria for bulk approval remains met; linkagemodels based on training data remain valid while a configured criticalmass of training facts remain valid; and transformation actions remainvalid as long as inputs to the transformation action remain valid;forming a directed acyclic graph (DAG) based on recorded provenance,where each node of the DAG without any in-edges represents a curationaction, and each node with in-edges represents an individual curationstate element; traversing the DAG to determine which curation stateelements depend from an individual element; using the DAG to determineinvalidated dependencies by calculating the set of descendants of allproposed changes when propagating changes to a curation state; whereinthe DPM applied to approve or reject changes in the updated proposal isselected from restrict, propagate, or override, and wherein restrictrejects any changes that change any existing facts, propagate usesprovenance information of any elements being changed to compute furtherchanges whose application makes the provenance of all elementsconsistent, and override updates any elements rendered invalid by thecandidate changes to being approved by the system operator: selecting aprior curation state to create a branch, updating the current curationstate to the prior curation state, and tracking all curation statechanges in the branch as children from the prior curation state; andmerging two different branches by: identifying which branch is a deeperdescendant from a least common ancestor curation state of both branches,and using the identified branch as a base; and applying, in sequentialorder from the least common ancestor curation state, all curation statechanges in the non-identified branch to the identified branch using apropagate DPM.